Following the SimpleTidy GeneCoEx workflow created by Dr. Chenxin Li for the analysis of co-expression networks.

Importing the necessary libraries for the analysis:

Importing the TPM gene expression matrix. It was generated by running Salmon on the RNA-seq data from the Kremling et al. (2018) paper and using the index of Maize v5 as the reference transcriptome. The matrix was filtered by TPM and coefficient of variation (CV) values.

Generating the Tidy data frame for the gene expression matrix:

Generating a wide version again (with log TPM values):

Importing metadata:

Generating a PCA:

Adding metadata to the PCA matrix:

Importing the correlation matrices. I (RACS) generated two different networks:

Later, a single network will also be generated, including both conditions.

Importing the two (day, night) networks:

We don’t really need duplicated information contained in the lower triangle of the matrix, so we can set it to NA:

For a initial edge selection, I (RACS) will use a correlation threshold of 0.5 (p-values and corrected p-values generated by corALS, so I will probably look at these values later):

dim(day_edge_table_select)
[1] 72637     3
dim(night_node_table)
[1] 3401    2

Creating the networks:

Clustering the networks:

For selection of the bait genes, we are probably going to use circadian genes, since we are comparing day and night periods.

Major factors in the experiment:

#Counting samples of different maize subpopulations
sample_metadata %>% 
  group_by(Subpopulation) %>% 
  count()
#Counting samples of different day periods
sample_metadata %>% 
  group_by(DayPeriod) %>% 
  count()
---
title: "Co-expression network analysis"
output: html_notebook
---

Following the [SimpleTidy GeneCoEx workflow](https://github.com/cxli233/SimpleTidy_GeneCoEx) created by Dr. Chenxin Li for the analysis of co-expression networks.

Importing the necessary libraries for the analysis:

```{r}
library(tidyverse)
#install.packages("tidyverse")
library(igraph)
library(ggraph)
#install.packages("ggraph")
library(readxl)
#install.packages("readxl")
library(patchwork)
#install.packages("patchwork")
library(RColorBrewer)
library(viridis)
```

Importing the TPM gene expression matrix. It was generated by running Salmon on the RNA-seq data from the Kremling et al. (2018) paper and using the index of Maize v5 as the reference transcriptome. The matrix was filtered by TPM and coefficient of variation (CV) values.

```{r}
Exp_table <- read_tsv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/poster_presentation_notebooks/plant_center_retreat_2024/kremling_expression_v5_tpm_filtered_cv_filtered.tsv", col_types = cols())
Exp_table
```

Generating the Tidy data frame for the gene expression matrix:

```{r}
Exp_table_long <- Exp_table %>% 
  pivot_longer(cols = !Name, names_to = "SampleName", values_to = "TPM") %>% 
  mutate(logTPM = log10(TPM + 1))

head(Exp_table_long)
```

Generating a wide version again (with log TPM values):

```{r}
Exp_table_log_wide <- Exp_table_long %>% 
  select(Name, SampleName, logTPM) %>% 
  pivot_wider(names_from = SampleName, values_from = logTPM)

head(Exp_table_log_wide)
```

Importing metadata:

```{r}
sample_metadata <- read_tsv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/poster_presentation_notebooks/pag_2025/sample_annotation.txt",
                            show_col_types = FALSE)
head(sample_metadata)
```

Generating a PCA:

```{r}
my_pca <- prcomp(t(Exp_table_log_wide[, -1]))
pc_importance <- as.data.frame(t(summary(my_pca)$importance))
head(pc_importance, 20)
```

```{r}
head(as.data.frame(my_pca$x[, 1:10]))
```

Adding metadata to the PCA matrix:

```{r}
PCA_coord <- my_pca$x[, 1:10] %>% 
  as.data.frame() %>% 
  mutate(SampleName = row.names(.)) %>% 
  full_join(sample_metadata %>% 
              select(Genotype, Subpopulation, DayPeriod, Tissue, SampleName), by = "SampleName")

head(PCA_coord)
```

```{r}
PCA_by_day_period <- PCA_coord %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point(aes(fill = DayPeriod), color = "grey20", shape = 21, size = 3, alpha = 0.8) +
  scale_fill_manual(values = brewer.pal(n = 3, "Accent")) +
  labs(x = paste("PC1 (", pc_importance[1, 2] %>% signif(3)*100, "% of Variance)", sep = ""), 
       y = paste("PC2 (", pc_importance[2, 2] %>% signif(3)*100, "% of Variance)", "  ", sep = ""),
       fill = NULL) +  
  theme_bw() +
  theme(
    text = element_text(size= 14),
    axis.text = element_text(color = "black")
  )

PCA_by_day_period
```

```{r}
PCA_by_subpopulation <- PCA_coord %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point(aes(fill = Subpopulation), color = "grey20", shape = 21, size = 3, alpha = 0.8) +
  labs(x = paste("PC1 (", pc_importance[1, 2] %>% signif(3)*100, "% of Variance)", sep = ""), 
       y = paste("PC2 (", pc_importance[2, 2] %>% signif(3)*100, "% of Variance)", "  ", sep = ""),
       fill = NULL) +  
  theme_bw() +
  theme(
    text = element_text(size= 14),
    axis.text = element_text(color = "black")
  )

PCA_by_subpopulation
```

Importing the correlation matrices. I (RACS) generated two different networks:

 * Day
 * Night
 
Later, a single network will also be generated, including both conditions.

Importing the two (day, night) networks:

```{r}
day_cor_matrix <- as.data.frame(read_tsv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/poster_presentation_notebooks/pag_2025/co_exp_day_pearson_correlation.tsv", col_types = cols()))
rownames(day_cor_matrix) <- day_cor_matrix[, 1]
day_cor_matrix <- day_cor_matrix[, -1]
night_cor_matrix <- as.data.frame(read_tsv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/poster_presentation_notebooks/pag_2025/co_exp_night_pearson_correlation.tsv", col_types = cols()))
rownames(night_cor_matrix) <- night_cor_matrix[, 1]
night_cor_matrix <- night_cor_matrix[, -1]
```

We don't really need duplicated information contained in the lower triangle of the matrix, so we can set it to NA:

```{r}
day_cor_matrix_tri <- day_cor_matrix
day_cor_matrix_tri[lower.tri(day_cor_matrix_tri)] <- NA
head(day_cor_matrix_tri)
night_cor_matrix_tri <- night_cor_matrix
night_cor_matrix_tri[lower.tri(night_cor_matrix_tri)] <- NA
head(night_cor_matrix_tri)
```

For a initial edge selection, I (RACS) will use a correlation threshold of 0.5 (p-values and corrected p-values generated by corALS, so I will probably look at these values later):

```{r}
# Converting the adjacency matrix to an edge table
day_edge_table <- day_cor_matrix_tri %>% 
  as.data.frame() %>% 
  mutate(from = row.names(day_cor_matrix)) %>%
  pivot_longer(cols = !from, names_to = "to", values_to = "r") %>%
  filter(is.na(r) == F) %>%
  filter(from != to)

head(day_edge_table)
dim(day_edge_table)

# Filtering edges with correlation coefficient >= 0.5
day_edge_table_select <- day_edge_table %>% 
  filter(r >= 0.5)

head(day_edge_table_select)
dim(day_edge_table_select)

# Converting the adjacency matrix to an edge table
night_edge_table <- night_cor_matrix_tri %>% 
  as.data.frame() %>% 
  mutate(from = row.names(night_cor_matrix)) %>%
  pivot_longer(cols = !from, names_to = "to", values_to = "r") %>%
  filter(is.na(r) == F) %>%
  filter(from != to)

head(night_edge_table)
dim(night_edge_table)

# Filtering edges with correlation coefficient >= 0.5
night_edge_table_select <- night_edge_table %>% 
  filter(r >= 0.5)

head(night_edge_table_select)
dim(night_edge_table_select)
```



```{r}
maize_funct_anno <- read_delim("/home/rsantos/Repositories/maize_microbiome_transcriptomics/poster_presentation_notebooks/plant_center_retreat_2024/kremling_expression_v5_tpm_filtered_cv_filtered_annotations.txt", delim = "\t", col_names = F, col_types = cols())
head(maize_funct_anno)

day_node_table <- data.frame(
  Name = c(day_edge_table_select$from, day_edge_table_select$to) %>% unique()
) %>% 
  left_join(maize_funct_anno, by = c("Name"="X1")) %>% 
  rename(functional_annotation = X2)
dim(day_node_table)

night_node_table <- data.frame(
  Name = c(night_edge_table_select$from, night_edge_table_select$to) %>% unique()
) %>% 
  left_join(maize_funct_anno, by = c("Name"="X1")) %>% 
  rename(functional_annotation = X2)
dim(night_node_table)
```

Creating the networks:

```{r}
day_network <- graph_from_data_frame(
  day_edge_table_select,
  vertices = day_node_table,
  directed = F
)

night_network <- graph_from_data_frame(
  night_edge_table_select,
  vertices = night_node_table,
  directed = F
)
```

Clustering the networks:

```{r}
day_net_modules <- cluster_leiden(day_network, resolution_parameter = 2,
                          objective_function = "modularity")
night_net_modules <- cluster_leiden(night_network, resolution_parameter = 2, 
                          objective_function = "modularity")
```

```{r}
optimize_resolution <- function(network, resolution){
  modules = network %>% 
    cluster_leiden(resolution_parameter = resolution,
                   objective_function = "modularity")
  
  parsed_modules = data.frame(
    gene_ID = names(membership(modules)),
    module = as.vector(membership(modules)) 
    )
  
  num_module_5 = parsed_modules %>% 
    group_by(module) %>% 
    count() %>% 
    arrange(-n) %>% 
    filter(n >= 5) %>% 
    nrow() %>% 
    as.numeric()
  
  num_genes_contained = parsed_modules %>% 
    group_by(module) %>% 
    count() %>% 
    arrange(-n) %>% 
    filter(n >= 5) %>% 
    ungroup() %>% 
    summarise(sum = sum(n)) %>% 
    as.numeric()
  
  c(num_module_5, num_genes_contained)

}
```


```{r}
optimization_results <- purrr::map_dfc(
  .x = seq(from = 0.25, to = 10, by = 0.25),
  .f = optimize_resolution, 
  network = day_network
) %>% 
  t() %>% 
  cbind(
   resolution = seq(from = 0.25, to = 10, by = 0.25)
  ) %>% 
  as.data.frame() %>% 
  rename(num_module = V1,
         num_contained_gene = V2)

optimization_results
```

```{r}
Optimize_num_module <- optimization_results %>% 
  ggplot(aes(x = resolution, y = num_module)) +
  geom_line(size = 1.1, alpha = 0.8, color = "dodgerblue2") +
  geom_point(size = 3, alpha = 0.7) +
  geom_vline(xintercept = 6.5, size = 1, linetype = 4) +
  labs(x = "resolution parameter",
       y = "num. modules\nw/ >=5 genes") +
  theme_classic() +
  theme(
    text = element_text(size = 14),
    axis.text = element_text(color = "black")
  )

Optimize_num_gene <- optimization_results %>% 
  ggplot(aes(x = resolution, y = num_contained_gene)) +
  geom_line(size = 1.1, alpha = 0.8, color = "violetred2") +
  geom_point(size = 3, alpha = 0.7) +
  geom_vline(xintercept = 6.5, size = 1, linetype = 4) +
  labs(x = "resolution parameter",
       y = "num. genes in\nmodules w/ >=5 genes") +
  theme_classic() +
  theme(
    text = element_text(size = 14),
    axis.text = element_text(color = "black")
  )

wrap_plots(Optimize_num_module, Optimize_num_gene, nrow = 2)
```


For selection of the bait genes, we are probably going to use circadian genes, since we are comparing day and night periods.


```{r}

```

Major factors in the experiment:

```{r}
#Counting samples of different maize subpopulations
sample_metadata %>% 
  group_by(Subpopulation) %>% 
  count()
```

```{r}
#Counting samples of different day periods
sample_metadata %>% 
  group_by(DayPeriod) %>% 
  count()
```

